Let $\mathbf{x} \in \mathbb{R}^D$ denote an input vector (e.g. the image of a handwritten digit) and $y \in \{1,...,C\}$ the corresponding label (e.g. the digit class).
We assume that there exists a function $f(\cdot; \boldsymbol\theta): \mathbb{R}^D \mapsto [0,1]^C$ parametrized by $\boldsymbol\theta \in \mathbb{R}^P$ such that:
$$p(y=k|\mathbf{x} ; \theta) = f_k(\mathbf{x}; \boldsymbol\theta), \qquad \forall k \in \{1,...,C\},$$where $f_k(\mathbf{x}; \boldsymbol\theta)$ is the $k$-th entry of $f(\mathbf{x}; \boldsymbol\theta) \in [0,1]^C$.
Our goal is to build a model $f(\mathbf{x}; \boldsymbol\theta)$ such that the prediction
$$ \hat{y} = \arg\max_{k \in \{1,...,C\}} p(y=k|\mathbf{x} ; \theta) = \arg\max_{k \in \{1,...,C\}} f_k(\mathbf{x}; \boldsymbol\theta)$$is as close as possible to the true label $y$.
Using the softmax function, our model $f_k(\mathbf{x}; \boldsymbol\theta)$ is defined as follows:
$$ f_k(\mathbf{x}; \boldsymbol\theta) = \frac{\exp(\boldsymbol\theta_k^\top \mathbf{x} + b_k)}{ \sum_{m=1}^C \exp(\boldsymbol\theta_m^\top \mathbf{x} + b_m)}, $$where $\boldsymbol\theta = \{\boldsymbol\theta_k \in \mathbb{R}^D, b_k \in \mathbb{R}\}_{k=1}^{C}$ are the model parameters.
This is the multinomial logistic regression model. Even though the name contains "regression", it is a model used for solving classification problems.
Multinomial logistic regression is considered as a linear method, because the prediction is computed by an affine combination of the input features. The softmax function, which is indeed not linear, is only introduced to ensure that $f_k(\mathbf{x}; \boldsymbol\theta) \ge 0$ and $\sum_{k=1}^C f_k(\mathbf{x}; \boldsymbol\theta) = 1$, which are properties required to interpret the output as a probability.
Note that the bias terms $b_k$ can be integrated into the weight vector $\boldsymbol\theta_k$ if we consider that the input vector $\mathbf{x}$ has an additional dimension equal to 1. The model thus simplifies as:
$$ f_k(\mathbf{x}; \boldsymbol\theta) = \frac{\exp(\boldsymbol\theta_k^\top \mathbf{x})}{ \sum_{m=1}^C \exp(\boldsymbol\theta_m^\top \mathbf{x})}. $$In the following, we assume that the data dimension $D$ already includes this additional input.
We assume that we have access to a training dataset of $N$ i.i.d samples $ \{(\mathbf{x}_i, y_i) \in \mathbb{R}^D \times \{1,...,C\}\}_{i=1}^N$. We are in a supervised learning framework.
The empirical risk is defined by:
$$ \mathcal{L}(\boldsymbol\theta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(\mathbf{x}_i; \boldsymbol\theta)), $$where $\ell$ is a loss function measuring the discrepancy between the true label $y_i$ and the prediction $f(\mathbf{x}_i; \boldsymbol\theta)$. In multinomial logistic regression, we use the negative log-likelihood (NLL) as loss function.
Let us define the one-hot vector $\mathbf{t}_i \in \{0,1\}^C$ encoding the true label $y_i$:
$$ t_{i,k} = [\mathbf{t}_i]_k = \begin{cases} 1 \hspace{.25cm} \text{if } y_i =k \\ 0 \hspace{.25cm} \text{otherwise } \end{cases}.$$It is just another representation of the label.
The NLL loss function is defined by:
$$ \begin{aligned} \ell(y_i, f(\mathbf{x}; \boldsymbol\theta)) &= - \ln p(\mathbf{x}_i,y_i; \boldsymbol\theta) \\ &= - \ln p(y_i | \mathbf{x}_i ; \boldsymbol\theta) - \ln p(\mathbf{x}_i) \\ &= - \ln \prod_{k=1}^{C} p(y_i=k | \mathbf{x}_i ; \boldsymbol\theta)^{t_{i,k}} + cst\\ &= - \ln \prod_{k=1}^{C} f_k(\mathbf{x}_i; \boldsymbol\theta)^{t_{i,k}} + cst\\ &= - \sum_{k=1}^{C} t_{i,k} \ln f_k(\mathbf{x}_i; \boldsymbol\theta) + cst. \end{aligned} $$Ignoring the constant, the complete loss function is therefore given by:
$$ \mathcal{L}(\boldsymbol\theta) = - \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{C} t_{i,k} \ln f_k(\mathbf{x}_i; \boldsymbol\theta). $$This is called the cross-entropy loss function.
To estimate the model parameters $\boldsymbol\theta$ we have to minimize the loss function $\mathcal{L}(\boldsymbol\theta)$. To do so, we can use the gradient descent algorithm. It is an iterative algorithm which consists in iterating:
$$ \boldsymbol\theta \leftarrow \boldsymbol\theta - \eta \nabla \mathcal{L}(\boldsymbol\theta), $$where $\eta > 0$ is the learning rate. Both the learning rate and the initialization of the parameters have a critical influence on the behavior of the algorithm.
We will perform this update independently for all $\boldsymbol\theta_j \in \boldsymbol\theta$, $j \in \{1,...,C\}$, so we need to compute
$$ \frac{\partial \mathcal{L}}{\partial \boldsymbol\theta_j} = - \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{C} t_{i,k} \frac{\partial}{\partial \boldsymbol\theta_j} \ln f_k(\mathbf{x}_i; \boldsymbol\theta). $$You can provide handwritten answers, no need to write them in Latex (but if you want you can).
Question 1) Using the model's definition, give the expression of $\displaystyle \frac{\partial}{\partial \boldsymbol\theta_j} \ln f_k(\mathbf{x}; \boldsymbol\theta)$. You need to consider separately the cases where $j=k$ and $j \neq k$. Your answer will involve the indicator function $\mathbb{1}_{\{j=k\}}$ which is equal to $1$ if $j=k$, $0$ otherwise. You will use the chain rule of derivatives and the following formula:
$$ \frac{\partial}{\partial \mathbf{v}} \mathbf{v}^\top \mathbf{A} = \mathbf{A}, $$which holds for any vector $\mathbf{v}$ and matrix $\mathbf{A}$ of appropriate dimension.
Question 2) Show that $\displaystyle \frac{\partial}{\partial \boldsymbol\theta_j} \mathcal{L}(\boldsymbol\theta) = \frac{1}{N} \sum_{i=1}^N (f_j(\mathbf{x}_i; \boldsymbol\theta) - t_{i,j}) \mathbf{x}_i$ using the result of the previous question and the definition of $t_{i,k}$.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
def to_one_hot(y):
y_one_hot = np.zeros((y.shape[0], y.max()+1))
y_one_hot[np.arange(y.shape[0]),y] = 1
return y_one_hot
def proba_to_class(y_hat):
return np.argmax(y_hat, axis=1)
x_digits, y_digits = datasets.load_digits(return_X_y=True)
x_digits = x_digits / x_digits.max()
x_digits = np.hstack((x_digits, np.ones((x_digits.shape[0],1))))
x_train, x_test, y_train, y_test = train_test_split(x_digits, y_digits, test_size=0.10, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.10, random_state=42)
y_one_hot_train = to_one_hot(y_train)
y_one_hot_val = to_one_hot(y_val)
y_one_hot_test = to_one_hot(y_test)
We can look at the first 5 training images.
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(x_train[0:5,:-1], y_train[0:5])):
plt.subplot(1, 5, index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
plt.title('Training: %i\n' % label, fontsize = 20)
Question 1) Look at the dimensions of the variables x_{train, val, test}
and y_{train, val, test}
and briefly describe the dataset.
Answer
Question 2) In the next cell, you have to complete the function prediction
to implement the prediction model $f(\mathbf{x}; \boldsymbol\theta) \in [0,1]^C$, $\mathbf{x} \in \mathbb{R}^D$.
This function takes as input
It should return a $(N, C)$ array of class probabilities.
To sum over one dimension of a numpy array you will use numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)
.
def prediction(x, theta):
"""
Computes the multinomial logistic regression prediction
Input: x (N, D) array
theta (C, D) array where C is the number of classes
Returns: prediction (N, C) array of class probabilities
"""
Solution
Let's try the prediction
function on one example using random parameters. We will use the above-defined fuction proba_to_class
to compute the argmax of the probabilities.
D = x_train.shape[1] # dimension of the input vector
N = x_train.shape[0] # number of training examples
C = y_one_hot_train.shape[1] # number of classes
theta = np.random.randn(C, D) # model parameters
# We add 'None' such that the output array dim. after slicing is (1,D) and not (,D)
y_prob_train_hat = prediction(x_train[None,1,:], theta)
y_train_hat = proba_to_class(y_prob_train_hat)
print('Class probabilities: ', end="")
print(y_prob_train_hat)
print('\nSum of probabilities: %.1f' % np.sum(y_prob_train_hat.squeeze()))
print('\nArgmax: %d' % y_train_hat)
The cross_entropy
function in the next cell computes the averaged cross entropy between targets (encoded as one-hot vectors) and predictions.
Question 3) You have to complete the cross_entropy_grad
function that computes the gradient of the cross entropy loss. It takes as input:
prediction
function;It should return a $(C, D)$ array containing the partial derivatives of the cross entropy loss w.r.t each scalar parameter.
Solution
def cross_entropy(predictions, targets, epsilon=1e-12):
"""
Computes the averaged cross entropy between targets (encoded as one-hot vectors)
and predictions.
Input: predictions (N, C) array
targets (N, C) array
Returns: scalar
"""
predictions = np.clip(predictions, epsilon, 1. - epsilon)
N = predictions.shape[0]
ce = -np.sum(targets*np.log(predictions))/N
return ce
def cross_entropy_grad(predictions, targets, x):
"""
Computes the gradient of the cross entropy loss
Input: predictions (N, C) array
targets (N, C) array
x (N, D) array
Returns: gradient (C, D) array
"""
return
Let's try your implementation.
grad = cross_entropy_grad(y_prob_train_hat, y_one_hot_train[None,1,:], x_train[None,1,:])
Question 4) You have to complete the next cell to implement the gradient descent learning algorithm using the above-defined functions: prediction
, cross_entropy_grad
, cross_entropy
, and proba_to_class
.
After each epoch or iteration, you will compute the cross entropy on the training and validation sets and append it to the list ce_train
and ce_val
. You will also compute the accuracy using the accuracy_score
function of scikit-learn and append it to the lists acc_train
and acc_val
.
You will implement early stopping to automatically stop the iterative learning algorithm when the cross-entropy loss has not decreased during a certain number of epochs specified in the patience
variable.
Solution
ce_train = []
acc_train = []
ce_val = []
acc_val = []
n_iter = 5000 # number of iter. for GD algo.
patience = 10 # early stopping patience
for epoch in np.arange(n_iter):
pass
In the next cell we plot the loss function and the accuracy along the epochs for the training and validation sets.
plt.figure()
plt.plot(ce_train)
plt.plot(ce_val)
plt.title('cross entropy loss')
plt.xlabel('iterations')
plt.legend(['train', 'val'])
plt.figure()
plt.plot(acc_train)
plt.plot(acc_val)
plt.title('accuracy')
plt.xlabel('iterations')
plt.legend(['train', 'val'])
Question 5) What do you observe?
Answer
Question 6) Compare the final perfomance on the training and test sets. How does the performance vary if you vary the number of examples in the training dataset?
Answer